Current Issue : January - March Volume : 2021 Issue Number : 1 Articles : 5 Articles
Aiming at the shortcomings of single network classification model, this paper applies CNN-LSTM (convolutional neural\nnetworks-long short-term memory) combined network in the field of music emotion classification and proposes a multifeature\ncombined network classifier based on CNN-LSTM which combines 2D (two-dimensional) feature input through CNN-LSTM and\n1D (single-dimensional) feature input through DNN (deep neural networks) to make up for the deficiencies of original single\nfeature models. The model uses multiple convolution kernels in CNN for 2D feature extraction, BiLSTM (bidirectional LSTM) for\nserialization processing and is used, respectively, for audio and lyrics single-modal emotion classification output. In the audio\nfeature extraction, music audio is finely divided and the human voice is separated to obtain pure background sound clips; the\nspectrogram and LLDs (Low Level Descriptors) are extracted therefrom. In the lyrics feature extraction, the chi-squared test vector\nand word embedding extracted by Word2vec are, respectively, used as the feature representation of the lyrics. Combining the two\ntypes of heterogeneous features selected by audio and lyrics through the classification model can improve the classification\nperformance. In order to fuse the emotional information of the two modals of music audio and lyrics, this paper proposes a\nmultimodal ensemble learning method based on stacking, which is different from existing feature-level and decision-level fusion\nmethods, the method avoids information loss caused by direct dimensionality reduction, and the original features are converted\ninto label results for fusion, effectively solving the problem of feature heterogeneity. Experiments on million song dataset show\nthat the audio classification accuracy of the multifeature combined network classifier in this paper reaches 68%, and the lyrics\nclassification accuracy reaches 74%. The average classification accuracy of the multimodal reaches 78%, which is significantly\nimproved compared with the single-modal....
This paper studies the segmentation and clustering of speaker speech. In order to improve the accuracy of speech endpoint\ndetection, the traditional double-threshold short-time average zero-crossing rate is replaced by a better spectrum centroid feature,\nand the local maxima of the statistical feature sequence histogram are used to select the threshold, and a new speech endpoint\ndetection algorithm is proposed. Compared with the traditional double-threshold algorithm, it effectively improves the detection\naccuracy and antinoise in low SNR. The k-means algorithm of conventional clustering needs to give the number of clusters in\nadvance and is greatly affected by the choice of initial cluster centers. At the same time, the self-organizing neural network\nalgorithm converges slowly and cannot provide accurate clustering information. An improved k-means speaker clustering\nalgorithm based on self-organizing neural network is proposed. Thenumber of clusters is predicted by the winning situation of the\ncompetitive neurons in the trained network, and the weights of the neurons are used as the initial cluster centers of the k-means\nalgorithm. The experimental results of multiperson mixed speech segmentation show that the proposed algorithm can effectively\nimprove the accuracy of speech clustering and make up for the shortcomings of the k-means algorithm and self-organizing neural\nnetwork algorithm....
The paperâ??s purpose is to design and program the four operation-calculators\nthat receives voice instructions and runs them as either a voice or text phase.\nThe Calculator simulates the work of the Compiler. The paper is a practical\nexample programmed to support that it is possible to construct a verbal\nCompiler....
With the development of virtual scenes, the degree of simulation and functions of virtual reality have been very complete,\nproviding a new platform and perspective for teaching design. Firstly, the hidden Markov chain model is used to perform emotion\nrecognition on English speech signals. English speech emotion recognition and speech semantic recognition are essentially the\nsame. Hidden Markov style has been widely used in English speech semantic recognition. The experiments of feature extraction\nand pattern recognition of speech samples prove that Hidden Markovian has higher recognition rate and better recognition effect\nin speech emotion recognition. Secondly, combining the human pronunciation model and the hearing model, by analyzing the\nimpact of the glottis feature on the human ear hearing-model feature, the research application of the English speech recognition\nemotion interactive simulation system uses the glottis feature to compensate the human ear, hearing feature is proposed by\ncompensated English speech recognition, and emotion interaction simulation system is used in the English speech emotion\nexperiment, which has obtained a high recognition rate and showed excellent performance....
As one of the most important communication tools for human beings, English pronunciation not only conveys literal information\nbut also conveys emotion through the change of tone. Based on the standard particle filtering algorithm, an improved auxiliary\ntraceless particle filtering algorithm is proposed. In importance sampling, based on the latest observation information, the\nunscented Kalman filter method is used to calculate each particle estimate to improve the accuracy of particle nonlinear\ntransformation estimation; during the resampling process, auxiliary factors are introduced to modify the particle weights to enrich\nthe diversity of particles and weaken particle degradation. The improved particle filter algorithm was used for online parameter\nidentification and compared with the standard particle filter algorithm, extended Kalman particle filter algorithm, and traceless\nparticle filter algorithm for parameter identification accuracy and calculation efficiency. The topic model is used to extract the\nsemantic space vector representation of English phonetic text and to sequentially predict the emotional information of different\nscales at the chapter level, paragraph level, and sentence level. The system has reasonable recognition ability for general speech,\nand the improved particle filter algorithm evaluation method is further used to optimize the defect of the English speech rationality\nand high recognition error rate Related experiments have verified the effectiveness of the method....
Loading....